This project aims to investigate the relation of police incident records, time, and location in San Francisco, California during 2018-2020, through analysis using R Markdown with packages, inlcluding readr, dplyr, ggplot2, ggrepel, and leaflet. Generally speaking, an obvious decrease since early 2020 can be observed in Assault, Larceny Theft, Lost Property, Non-criminal, Other Miscellaneous, Robbery, and Warrant. However, Burglary, Motor Vehicle Theft, and Recovered vehicles have experienced some increase since early 2020. Crimes are mostly reported during afternoon hours, especially around 5-8 PM weekdays and around noon all days, and there are many crimes reported around 12 AM during weekends. Arrests are mostly during 3-4 PM on Tuesdays and 12-2 PM on Wednesdays.In conclusion, the factor that affects when the crime or arrest happens the most is the crime category, while other factors such as year, month, and police precinct do not have much impact on when the crimes happens.
This project aims to investigate the relation of police incident records, time and location in San Francisco, California during 2018-2020 based on the San Francisco police opendata. Since crime has been a major social concern of urban areas, when these unlawful activities happened and reported can be important to San Francisco as a major city of U.S.. Although the reason why crime happened in certain time will not be revealed in this project, assumptions may be provided. The dataset contains police records in San Francisco since 2018 but only data from 2018 to 2020 will be included in the relation analysis. Some records with too much missing critial values will be automatically removed from the analysis. These police records are used as an indicator of unlawful activities, which may be referred as crime in the following.
if (!require(readr)) install.packages("readr")
if (!require(dplyr)) install.packages("dplyr")
if (!require(ggrepel)) install.packages("ggrepel")
if (!require(leaflet)) install.packages("leaflet")
Examine the dataset.
library(readr)
# path <- "/Users/zhenhuang/Downloads/Final Project/CRIME2018_RECENT.csv"
path <- "/Users/zhenhuang/Downloads/Final Project/Police_Department_Incident_Reports__2018_to_Present.csv"
df <- read_csv(path)
# str(df)
The original column names are a bit redundant. This step is to simplify the column names that may be used in the following analysis.
library(tidyverse)
library(dplyr)
df <- df %>% rename(Date=`Incident Date`,
Time = `Incident Time`,
Year = `Incident Year`,
DayOfWeek = `Incident Day of Week`,
Category=`Incident Category`,
Descript=`Incident Description`,
PdDistrict = `Police District`,
Y = Latitude,
X = Longitude) %>%
mutate(Time = as.character(Time))
# str(df)
The graph indicates the daily police records number decreased dramatically during 2020. (probably because of the Covid-19, everyone just stayed at home) The months with lowest daily number of police records is around March to April 2020 when the government had released the quarantine announcement. Generally, the number of daily police records after 2020 are much lower than the previous.
library(dplyr)
library(ggplot2)
df_crime_daily <- df %>%
mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
group_by(Date) %>%
summarize(count = n()) %>%
arrange(Date)
df_crime_daily %>%
ggplot( aes(x = Date, y = count)) +
geom_line(color = "#4EA8DE", size = 0.1) +
geom_smooth() +
labs(x = "Date", y = "Number of Police Records", title = "Daily Police Records in San Francisco (2018–Recent)")
### Interactive Map of Crime Incidents This map shows the locations of crime incidents during 2018-2020. Clicking on pop up icons on the map can show incident details.
library(leaflet)
# data <- df
# data <- df %>% filter(Year %in% c(2018,2019,2020))
data <- df[ 1:10000,]# Testing Only:using only the first 10,000 rows (much faster)
data$popup <- paste("<b>Incident #: </b>", data$`Incident Number`,
"<br>", "<b>Category: </b>", data$Category,
"<br>", "<b>Description: </b>", data$Descript,
"<br>", "<b>Day of week: </b>", data$DayOfWeek,
"<br>", "<b>Date: </b>", data$Date,
"<br>", "<b>Time: </b>", data$Time,
"<br>", "<b>PD district: </b>", data$PdDistrict,
"<br>", "<b>Resolution: </b>", data$Resolution,
"<br>", "<b>Longitude: </b>", data$X,
"<br>", "<b>Latitude: </b>", data$Y)
data %>%
leaflet() %>%
addTiles() %>%
addMarkers(lng = ~X, lat = ~Y, popup = data$popup, clusterOptions = markerClusterOptions())
Summarize the data by incident category. According to the list of the most frequent record categories, Larceny Theft takes about 30% of all records, being the category with the largest percentage. The top 20 frequent categories are, (starting from the most frequesnt), Larceny Theft, Other Miscellaneous, Malicious Mischief, Non-criminal, Assault, Burglary, Motor Vehicle Theft, Recovered Vehicle, Warrant and Lost Property.
df_category <- df %>%
select(Category) %>%
group_by(Category) %>%
summarise(Frequency=n()) %>%
mutate(Percentage=(Frequency*100 / sum(Frequency))) %>%
arrange(desc(Frequency))
head(df_category,20)
## # A tibble: 20 x 3
## Category Frequency Percentage
## <chr> <int> <dbl>
## 1 Larceny Theft 142622 30.2
## 2 Other Miscellaneous 34758 7.35
## 3 Malicious Mischief 31074 6.57
## 4 Non-Criminal 29055 6.14
## 5 Assault 28308 5.99
## 6 Burglary 26585 5.62
## 7 Motor Vehicle Theft 21894 4.63
## 8 Recovered Vehicle 17141 3.62
## 9 Warrant 15508 3.28
## 10 Lost Property 14690 3.11
## 11 Fraud 14235 3.01
## 12 Drug Offense 11484 2.43
## 13 Robbery 11071 2.34
## 14 Missing Person 10616 2.25
## 15 Suspicious Occ 9440 2.00
## 16 Disorderly Conduct 8023 1.70
## 17 Offences Against The Family And Children 6602 1.40
## 18 Traffic Violation Arrest 5526 1.17
## 19 Miscellaneous Investigation 4456 0.942
## 20 Other Offenses 4009 0.848
This prie chart shows the percentage that each record category takes in all records.
df_category %>% ggplot(aes(x="", y=Percentage, fill=Category)) + geom_bar(stat="identity") + coord_polar("y")
The following is a bar plot of incident categories with high frequency.
df_category %>%
filter(Frequency > 20000) %>%
ggplot(aes(x= reorder(Category, -Frequency), y=Frequency, fill=Category)) +
geom_bar(stat="identity")
Record number by date and category: According to the graph, Burgary had a peak around May 2020 and it has increased a bit since early 2020; Larceny Theft has the largest variance and it has had the similar dramatic decrease to the all records trend since early 2020; Lost Property also has expereinced some decrease since earily 2020; Warrant had a few peaks during 2019. Generally speaking, obvious decrease since early 2020 can be observed in Assault, Larceny Theft, Lost Property, Non-criminal, Other Miscellaneous, Robbery and Warrant. However, Burglary, Motor Vehicle Theft and Recovered Vehicle have expereinced some increase since early 2020.
df_var <- df %>%
mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
group_by(Category,Date) %>%
summarize(count = n())
df_var %>%
ggplot( aes(x = Date, y = count)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Date", y = "Count", title = "Number of Police Records in San Francisco by Category and Date (2018-2020)") +
facet_wrap(~ Category, nrow = 5)
df_var <- df_var %>%
spread( Category, count)
# df_var$Burglary %>% var()
# df_var$`Malicious Mischief` %>% var()
# df_var$`Larceny Theft` %>% var()
After exploring the dataset, analyze the general trends of when the crimes happened, and the relation between crime and when by time and day of week.
According to the previous list, Larceny Thefts happen most often, so let’s explore more about this category.
Trend of the number of Larceny thefts by date:As the graph shows, theft frequency drops dramatically during 2020, which is similar to the trend of daily crime number.
df %>%
filter(Category=="Larceny Theft") %>%
mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
group_by(Date) %>%
summarize(count = n()) %>%
arrange(Date) %>%
ggplot( aes(x = Date, y = count)) +
geom_line(color = "#4EA8DE", size = 0.1) +
geom_smooth() +
labs(x = "Date of Theft", y = "Number of Thefts", title = "Daily Thefts in San Francisco (2018–Recent)")
To further understanding of when theft happened most often, hour and day of the week are analyzed.With a heat map of days of week and hours, readers can see which day of the week and hours of day has the highest incident records, and contrast that to other time intervals at a quick glance.
This necessitates the separation of the Hour and Day-of-Week columns: although the datasets has a day of week column, extraction of the Hour part from the Time column is still required. To extract the hour, strsplit function is used.
The heatmap indicates that there is a larger number of theft happened between 6PM to 7PM during weekdays.
# First, extract hour from time records.
get_hour <- function(x) {
return(as.numeric(strsplit(x,":")[[1]][1]))
}
df_theft_time <- df%>%
filter(Category=="Larceny Theft") %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(DayOfWeek, Hour) %>%
summarize(count = n())
# head(df_theft_time,10)
# Reorder and format
dow_format <- c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday")
hour_format <- c(paste(c(12,1:11),"AM"), paste(c(12,1:11),"PM"))
df_theft_time$DayOfWeek <- factor(df_theft_time$DayOfWeek, level = rev(dow_format))
df_theft_time$Hour <- factor(df_theft_time$Hour, level = 0:23, label = hour_format)
# create heatmap
df_theft_time %>%
ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Thefts in San Francisco by Time of Theft (2018-Recent)") +
scale_fill_gradient(low = "#ffffff", high = "#6930c3")
use the same heatmap for all crimes
#excluding 2021, since 2021 has not ended
df_whole <- df %>%
filter(Year %in% c(2018,2019,2020))
df_whole_time <- df_whole %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(DayOfWeek, Hour) %>%
summarize(count = n())
df_whole_time$DayOfWeek <- factor(df_whole_time$DayOfWeek, level = rev(dow_format))
df_whole_time$Hour <- factor(df_whole_time$Hour, level = 0:23, label = hour_format)
df_whole_time %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Police Records in San Francisco by Time (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#6930c3")
To minimize the effect of crimes with low frequencies, create heatmap for crime categories with over 20000 records.
df_top_crimes <- df_whole %>%
group_by(Category) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
filter(count>10000)
df_whole_time <- df_whole %>%
filter(Category %in% df_top_crimes$Category) %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(DayOfWeek, Hour) %>%
summarize(count = n())
df_whole_time$DayOfWeek <- factor(df_whole_time$DayOfWeek, level = rev(dow_format))
df_whole_time$Hour <- factor(df_whole_time$Hour, level = 0:23, label = hour_format)
df_whole_time %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of High-Frequency Crimes in San Francisco by Time (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#6930c3")
High-frequency crimes are mostly reported during afternoon hours, especially around 5-8PM weekdays and around noon all days, and there are many crimes reported around 12AM during weekends.
The following graphs aimt to explain the relation between Time/Day of the Week and crime records
To further discuss crime category as a factor of time of crime reported, display the heatmap by crime category.
df_whole_time_crime <- df_whole %>%
filter(Category %in% df_top_crimes$Category) %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(Category, DayOfWeek, Hour) %>%
summarize(count = n())
df_whole_time_crime$DayOfWeek <- factor(df_whole_time_crime$DayOfWeek, level = rev(dow_format))
df_whole_time_crime$Hour <- factor(df_whole_time_crime$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_crime)
df_whole_time_crime %>% ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Category and Time (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
facet_wrap(~ Category, nrow = 6)
The above data has too many Larceny Theft records to show the time distribution of other categories, thus, it needs normalization.
# df_whole_time_crime <- df_whole_time_crime %>%
# group_by(Category) %>%
# mutate(norm = count/sum(count))
# normalized percent inhibition
df_whole_time_crime %>%
group_by(Category) %>%
mutate(norm = (count-min(count))/(max(count)-min(count))) %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Time, Normalized by Category (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
facet_wrap(~ Category, nrow = 6)
Adding analysis with details…….
Same as above, but with Police Districts.
# exclude Police Districts that are out of San Francisco
df_whole_time_district <- df_whole %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(PdDistrict, DayOfWeek, Hour) %>%
summarize(count = n()) %>%
group_by(PdDistrict) %>%
filter(!PdDistrict=="Out of SF" ) %>%
mutate(norm = (count-min(count))/(max(count)-min(count)))
df_whole_time_district$DayOfWeek <- factor(df_whole_time_district$DayOfWeek, level = rev(dow_format))
df_whole_time_district$Hour <- factor(df_whole_time_district$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_district)
df_whole_time_district %>%
ggplot( aes(x = Hour, y = DayOfWeek, fill = norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Time, Normalized by Station (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#5390D9") +
facet_wrap(~ PdDistrict, nrow = 7)
If crime is tied to activities, the period at which activies end may impact.
df_whole_time_month <- df_whole %>%
mutate(Month = format(as.Date(Date, "%Y/%m/%d"), "%B"), Hour = sapply(Time, get_hour)) %>%
group_by(Month, DayOfWeek, Hour) %>%
summarize(count = n()) %>%
group_by(Month) %>%
mutate(norm = (count-min(count))/(max(count)-min(count)))
df_whole_time_month$DayOfWeek <- factor(df_whole_time_month$DayOfWeek, level = rev(dow_format))
df_whole_time_month$Hour <- factor(df_whole_time_month$Hour, level = 0:23, label = hour_format)
# Set order of month
df_whole_time_month$Month <- factor(df_whole_time_month$Month,
level = c("January","February","March","April","May","June","July","August","September","October","November","December"))
df_whole_time_month %>%
ggplot( aes(x = Hour, y = DayOfWeek, fill = norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Police Records in San Francisco by Time, Normalized by Month (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#23A9A9") +
facet_wrap(~ Month, nrow = 6)
Perhaps things changed over years?
df_whole_time_year <- df_whole %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(Year, DayOfWeek, Hour) %>%
summarize(count = n()) %>%
group_by(Year) %>%
mutate(norm = (count-min(count))/(max(count)-min(count)))
df_whole_time_year$DayOfWeek <- factor(df_whole_time_year$DayOfWeek, level = rev(dow_format))
df_whole_time_year$Hour <- factor(df_whole_time_year$Hour, level = 0:23, label = hour_format)
df_whole_time_year %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Police Records in San Francisco by Time, Normalized by Year (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#00A375") +
facet_wrap(~ Year, nrow = 3)
#excluding 2021, since 2021 has not ended
#only arrest records
df_whole <- df %>%
filter(Year %in% c(2018,2019,2020) & grepl("Arrest", Resolution))
df_whole %>%
mutate(Date = as.Date(Date, "%Y/%m/%d")) %>%
group_by(Date) %>%
summarize(count = n()) %>%
arrange(Date) %>%
ggplot(aes(x = Date, y = count)) +
geom_line(color = "#4EA8DE", size = 0.1) +
geom_smooth() +
labs(x = "Date of Arrest", y = "# of Police Arrests", title = "Number of Arrest in San Francisco (2018–2020)")
df_whole_time <- df_whole %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(DayOfWeek, Hour) %>%
summarize(count = n())
df_whole_time$DayOfWeek <- factor(df_whole_time$DayOfWeek, level = rev(dow_format))
df_whole_time$Hour <- factor(df_whole_time$Hour, level = 0:23, label = hour_format)
df_whole_time %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Arrest in San Francisco by Time (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#6930c3")
To minimize the effect of arrest with low frequencies, create heatmap for top 12 crime categories with highest frequency
df_top_crimes <- df_whole %>%
group_by(Category) %>%
summarize(count = n()) %>%
arrange(desc(count))
df_top_crimes <- df_top_crimes[1:12,]
df_whole_time <- df_whole %>%
filter(Category %in% df_top_crimes$Category) %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(DayOfWeek, Hour) %>%
summarize(count = n())
df_whole_time$DayOfWeek <- factor(df_whole_time$DayOfWeek, level = rev(dow_format))
df_whole_time$Hour <- factor(df_whole_time$Hour, level = 0:23, label = hour_format)
df_whole_time %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Frequent Crime Arrests in San Francisco by Time (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#6930c3")
High-frequency arrests are mostly reported during 3-4 PM on Tuesdays and 12-2PM on Wednesdays.
To further discuss crime category as a factor of time of crime reported, display the heatmap by crime category.
df_whole_time_crime <- df_whole %>%
filter(Category %in% df_top_crimes$Category) %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(Category, DayOfWeek, Hour) %>%
summarize(count = n())
df_whole_time_crime$DayOfWeek <- factor(df_whole_time_crime$DayOfWeek, level = rev(dow_format))
df_whole_time_crime$Hour <- factor(df_whole_time_crime$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_crime)
df_whole_time_crime %>% ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Number of Crime in San Francisco by Category and Time (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
facet_wrap(~ Category, nrow = 6)
# # since Fraud and Lost Property have too many missing data, they will be excluded in the following analysis
# df_top_crimes <- df_top_crimes %>%
# filter(!Category %in% c("Fraud","Lost Property"))
#
# #draw again without Fraud and Lost Property
# df_whole_time_crime <- df_whole %>%
# filter(Category %in% df_top_crimes$Category) %>%
# mutate(Hour = sapply(Time, get_hour)) %>%
# group_by(Category, DayOfWeek, Hour) %>%
# summarize(count = n())
#
# df_whole_time_crime$DayOfWeek <- factor(df_whole_time_crime$DayOfWeek, level = rev(dow_format))
# df_whole_time_crime$Hour <- factor(df_whole_time_crime$Hour, level = 0:23, label = hour_format)
#
# df_whole_time_crime %>% ggplot( aes(x = Hour, y = DayOfWeek, fill = count)) +
# geom_tile() +
# theme(axis.text.x = element_text(angle = 90)) +
# labs(x = "Time", y = "Day of Week", title = "Arrests in San Francisco by Category and Time (2018-2020)") +
# scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
# facet_wrap(~ Category, nrow = 5)
Normalization.
# normalized percent inhibition
df_whole_time_crime %>%
group_by(Category) %>%
mutate(norm = (count-min(count))/(max(count)-min(count))) %>%
ggplot(aes(x = Hour, y = DayOfWeek, fill = norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Arrests in San Francisco by Time, Normalized by Category (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#5E60CE") +
facet_wrap(~ Category, nrow = 6)
Adding analysis with details…….
Same as above, but with Police Districts.
# exclude Police Districts that are out of San Francisco
df_whole_time_district <- df_whole %>%
mutate(Hour = sapply(Time, get_hour)) %>%
group_by(PdDistrict, DayOfWeek, Hour) %>%
summarize(count = n()) %>%
group_by(PdDistrict) %>%
filter(!PdDistrict=="Out of SF" ) %>%
mutate(norm = (count-min(count))/(max(count)-min(count)))
df_whole_time_district$DayOfWeek <- factor(df_whole_time_district$DayOfWeek, level = rev(dow_format))
df_whole_time_district$Hour <- factor(df_whole_time_district$Hour, level = 0:23, label = hour_format)
# head(df_whole_time_district)
df_whole_time_district %>%
ggplot( aes(x = Hour, y = DayOfWeek, fill = norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "Time", y = "Day of Week", title = "Arrests in San Francisco by Time, Normalized by Station (2018-2020)") +
scale_fill_gradient(low = "#ffffff", high = "#5390D9") +
facet_wrap(~ PdDistrict, nrow = 7)